Вам дана выгрузка событий из web-воронки со следующими типами событий:
- onboarding_start - начало прохождения воронки
- profile_start - пользователь начал заполнение анкеты
- email_submit - пользователь ввел свой почтовый адрес
- paywall_show - показ пэйвола
- payment_done - пользователь совершил покупку
Посчитайте какая доля пользователей от изначального проходит до каждого из этапов воронки и какая доля теряется на каждом из этапов. Используя событие experiment_exposure, выберете 3 наиболее перспективных теста, свой выбор обоснуйте.
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import math
from statsmodels.stats.proportion import proportions_ztest
Data Uploading¶
df = pd.read_csv("simple_interview_events.csv")
df
| user_id | event_type | event_time | event_params | |
|---|---|---|---|---|
| 0 | 32001 | onboarding_start | 2024-01-01T00:01:40 | {"funnel_type": "female"} |
| 1 | 99564 | onboarding_start | 2024-01-01T00:01:53 | {"funnel_type": "male"} |
| 2 | 32001 | profile_start | 2024-01-01T00:01:58 | {} |
| 3 | 99564 | profile_start | 2024-01-01T00:02:07 | {} |
| 4 | 71575 | onboarding_start | 2024-01-01T00:02:18 | {"funnel_type": "female"} |
| ... | ... | ... | ... | ... |
| 346323 | 54690 | paywall_show | 2024-04-01T00:01:40 | {} |
| 346324 | 11354 | paywall_show | 2024-04-01T00:02:51 | {} |
| 346325 | 64903 | paywall_show | 2024-04-01T00:04:27 | {} |
| 346326 | 41417 | paywall_show | 2024-04-01T00:05:01 | {} |
| 346327 | 66780 | paywall_show | 2024-04-01T00:05:25 | {} |
346328 rows × 4 columns
Data Preprocessing¶
unique_values_costs = {col: df[col].unique() for col in df.columns}
unique_values_costs
{'user_id': array([32001, 99564, 71575, ..., 64903, 41417, 66780], shape=(100000,)),
'event_type': array(['onboarding_start', 'profile_start', 'email_submit',
'paywall_show', 'payment_done', 'experiment_exposure'],
dtype=object),
'event_time': array(['2024-01-01T00:01:40', '2024-01-01T00:01:53',
'2024-01-01T00:01:58', ..., '2024-04-01T00:04:27',
'2024-04-01T00:05:01', '2024-04-01T00:05:25'],
shape=(314666,), dtype=object),
'event_params': array(['{"funnel_type": "female"}', '{"funnel_type": "male"}', '{}',
'{"funnel_type": "main"}',
'{"experiment_name": "exp_5", "experiment_group": "test"}',
'{"experiment_name": "exp_5", "experiment_group": "control"}',
'{"experiment_name": "exp_1", "experiment_group": "test"}',
'{"experiment_name": "exp_1", "experiment_group": "control"}',
'{"experiment_name": "exp_8", "experiment_group": "test"}',
'{"experiment_name": "exp_8", "experiment_group": "control"}',
'{"experiment_name": "exp_2", "experiment_group": "control"}',
'{"experiment_name": "exp_2", "experiment_group": "test"}',
'{"experiment_name": "exp_7", "experiment_group": "control"}',
'{"experiment_name": "exp_7", "experiment_group": "test"}',
'{"experiment_name": "exp_0", "experiment_group": "test"}',
'{"experiment_name": "exp_0", "experiment_group": "control"}',
'{"experiment_name": "exp_6", "experiment_group": "control"}',
'{"experiment_name": "exp_6", "experiment_group": "test"}',
'{"experiment_name": "exp_9", "experiment_group": "control"}',
'{"experiment_name": "exp_9", "experiment_group": "test"}'],
dtype=object)}
парсинг данных в event_params
df["funnel_type"] = df["event_params"].str.extract(r'"funnel_type":\s*"([^"]+)"')
df
| user_id | event_type | event_time | event_params | funnel_type | |
|---|---|---|---|---|---|
| 0 | 32001 | onboarding_start | 2024-01-01T00:01:40 | {"funnel_type": "female"} | female |
| 1 | 99564 | onboarding_start | 2024-01-01T00:01:53 | {"funnel_type": "male"} | male |
| 2 | 32001 | profile_start | 2024-01-01T00:01:58 | {} | NaN |
| 3 | 99564 | profile_start | 2024-01-01T00:02:07 | {} | NaN |
| 4 | 71575 | onboarding_start | 2024-01-01T00:02:18 | {"funnel_type": "female"} | female |
| ... | ... | ... | ... | ... | ... |
| 346323 | 54690 | paywall_show | 2024-04-01T00:01:40 | {} | NaN |
| 346324 | 11354 | paywall_show | 2024-04-01T00:02:51 | {} | NaN |
| 346325 | 64903 | paywall_show | 2024-04-01T00:04:27 | {} | NaN |
| 346326 | 41417 | paywall_show | 2024-04-01T00:05:01 | {} | NaN |
| 346327 | 66780 | paywall_show | 2024-04-01T00:05:25 | {} | NaN |
346328 rows × 5 columns
df["experiment_name"] = df["event_params"].str.extract(r'"experiment_name":\s*"([^"]+)"')
df
| user_id | event_type | event_time | event_params | funnel_type | experiment_name | |
|---|---|---|---|---|---|---|
| 0 | 32001 | onboarding_start | 2024-01-01T00:01:40 | {"funnel_type": "female"} | female | NaN |
| 1 | 99564 | onboarding_start | 2024-01-01T00:01:53 | {"funnel_type": "male"} | male | NaN |
| 2 | 32001 | profile_start | 2024-01-01T00:01:58 | {} | NaN | NaN |
| 3 | 99564 | profile_start | 2024-01-01T00:02:07 | {} | NaN | NaN |
| 4 | 71575 | onboarding_start | 2024-01-01T00:02:18 | {"funnel_type": "female"} | female | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 346323 | 54690 | paywall_show | 2024-04-01T00:01:40 | {} | NaN | NaN |
| 346324 | 11354 | paywall_show | 2024-04-01T00:02:51 | {} | NaN | NaN |
| 346325 | 64903 | paywall_show | 2024-04-01T00:04:27 | {} | NaN | NaN |
| 346326 | 41417 | paywall_show | 2024-04-01T00:05:01 | {} | NaN | NaN |
| 346327 | 66780 | paywall_show | 2024-04-01T00:05:25 | {} | NaN | NaN |
346328 rows × 6 columns
df["experiment_group"] = df["event_params"].str.extract(r'"experiment_group":\s*"([^"]+)"')
df
| user_id | event_type | event_time | event_params | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|---|
| 0 | 32001 | onboarding_start | 2024-01-01T00:01:40 | {"funnel_type": "female"} | female | NaN | NaN |
| 1 | 99564 | onboarding_start | 2024-01-01T00:01:53 | {"funnel_type": "male"} | male | NaN | NaN |
| 2 | 32001 | profile_start | 2024-01-01T00:01:58 | {} | NaN | NaN | NaN |
| 3 | 99564 | profile_start | 2024-01-01T00:02:07 | {} | NaN | NaN | NaN |
| 4 | 71575 | onboarding_start | 2024-01-01T00:02:18 | {"funnel_type": "female"} | female | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 346323 | 54690 | paywall_show | 2024-04-01T00:01:40 | {} | NaN | NaN | NaN |
| 346324 | 11354 | paywall_show | 2024-04-01T00:02:51 | {} | NaN | NaN | NaN |
| 346325 | 64903 | paywall_show | 2024-04-01T00:04:27 | {} | NaN | NaN | NaN |
| 346326 | 41417 | paywall_show | 2024-04-01T00:05:01 | {} | NaN | NaN | NaN |
| 346327 | 66780 | paywall_show | 2024-04-01T00:05:25 | {} | NaN | NaN | NaN |
346328 rows × 7 columns
df = df.drop(columns=["event_params"])
df
| user_id | event_type | event_time | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|
| 0 | 32001 | onboarding_start | 2024-01-01T00:01:40 | female | NaN | NaN |
| 1 | 99564 | onboarding_start | 2024-01-01T00:01:53 | male | NaN | NaN |
| 2 | 32001 | profile_start | 2024-01-01T00:01:58 | NaN | NaN | NaN |
| 3 | 99564 | profile_start | 2024-01-01T00:02:07 | NaN | NaN | NaN |
| 4 | 71575 | onboarding_start | 2024-01-01T00:02:18 | female | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 346323 | 54690 | paywall_show | 2024-04-01T00:01:40 | NaN | NaN | NaN |
| 346324 | 11354 | paywall_show | 2024-04-01T00:02:51 | NaN | NaN | NaN |
| 346325 | 64903 | paywall_show | 2024-04-01T00:04:27 | NaN | NaN | NaN |
| 346326 | 41417 | paywall_show | 2024-04-01T00:05:01 | NaN | NaN | NaN |
| 346327 | 66780 | paywall_show | 2024-04-01T00:05:25 | NaN | NaN | NaN |
346328 rows × 6 columns
тестовый юзер
df[df["user_id"] == 91875]
| user_id | event_type | event_time | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|
| 34702 | 91875 | experiment_exposure | 2024-01-10T07:21:02 | NaN | exp_8 | test |
| 34703 | 91875 | onboarding_start | 2024-01-10T07:21:02 | male | NaN | NaN |
| 34705 | 91875 | profile_start | 2024-01-10T07:21:16 | NaN | NaN | NaN |
| 34708 | 91875 | email_submit | 2024-01-10T07:21:36 | NaN | NaN | NaN |
| 34717 | 91875 | paywall_show | 2024-01-10T07:28:28 | NaN | NaN | NaN |
заполняем NaN известными значениями
df = df.sort_values(["user_id", "event_time"])
df
| user_id | event_type | event_time | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|
| 193355 | 0 | onboarding_start | 2024-02-20T16:21:51 | male | NaN | NaN |
| 193356 | 0 | profile_start | 2024-02-20T16:22:05 | NaN | NaN | NaN |
| 193357 | 0 | email_submit | 2024-02-20T16:22:25 | NaN | NaN | NaN |
| 193375 | 0 | paywall_show | 2024-02-20T16:29:17 | NaN | NaN | NaN |
| 317565 | 1 | onboarding_start | 2024-03-24T07:48:11 | main | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 86357 | 99998 | paywall_show | 2024-01-23T07:49:26 | NaN | NaN | NaN |
| 300974 | 99999 | onboarding_start | 2024-03-20T00:41:12 | female | NaN | NaN |
| 300975 | 99999 | profile_start | 2024-03-20T00:41:30 | NaN | NaN | NaN |
| 300976 | 99999 | email_submit | 2024-03-20T00:41:55 | NaN | NaN | NaN |
| 300999 | 99999 | paywall_show | 2024-03-20T00:49:25 | NaN | NaN | NaN |
346328 rows × 6 columns
df[['experiment_name', 'experiment_group', 'funnel_type']] = (
df.groupby('user_id')[['experiment_name', 'experiment_group', 'funnel_type']]
.transform(lambda x: x.ffill().bfill())
)
/var/folders/cj/5yw5vpbx3w793bdrt43m_7zm0000gn/T/ipykernel_76229/182570417.py:3: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
.transform(lambda x: x.ffill().bfill())
проверка
df[df["user_id"] == 91875]
| user_id | event_type | event_time | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|
| 34702 | 91875 | experiment_exposure | 2024-01-10T07:21:02 | male | exp_8 | test |
| 34703 | 91875 | onboarding_start | 2024-01-10T07:21:02 | male | exp_8 | test |
| 34705 | 91875 | profile_start | 2024-01-10T07:21:16 | male | exp_8 | test |
| 34708 | 91875 | email_submit | 2024-01-10T07:21:36 | male | exp_8 | test |
| 34717 | 91875 | paywall_show | 2024-01-10T07:28:28 | male | exp_8 | test |
удаление строк с funnel_type из-за парсинга experiment_name experiment_group
смотрим есть ли такие пользователи, что у них единственная строка - это строка funnel_type = NaN, чтобы знать можно ли удалить
users_with_nan = df[df["funnel_type"].isna()]["user_id"]
user_event_counts = df["user_id"].value_counts()
# пользователи, у которых только одна строка в датасете
single_event_users = user_event_counts[user_event_counts == 1].index
# пересекаем с теми, у кого эта строка — с NaN funnel_type
users_with_only_nan_row = users_with_nan[users_with_nan.isin(single_event_users)].unique()
print(f"Таких пользователей: {len(users_with_only_nan_row)}")
print(users_with_only_nan_row)
Таких пользователей: 0 []
нет таких, удаляем
df = df[df['event_type'] != 'experiment_exposure'].fillna(0)
df
| user_id | event_type | event_time | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|
| 193355 | 0 | onboarding_start | 2024-02-20T16:21:51 | male | 0 | 0 |
| 193356 | 0 | profile_start | 2024-02-20T16:22:05 | male | 0 | 0 |
| 193357 | 0 | email_submit | 2024-02-20T16:22:25 | male | 0 | 0 |
| 193375 | 0 | paywall_show | 2024-02-20T16:29:17 | male | 0 | 0 |
| 317565 | 1 | onboarding_start | 2024-03-24T07:48:11 | main | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 86357 | 99998 | paywall_show | 2024-01-23T07:49:26 | female | exp_8 | test |
| 300974 | 99999 | onboarding_start | 2024-03-20T00:41:12 | female | 0 | 0 |
| 300975 | 99999 | profile_start | 2024-03-20T00:41:30 | female | 0 | 0 |
| 300976 | 99999 | email_submit | 2024-03-20T00:41:55 | female | 0 | 0 |
| 300999 | 99999 | paywall_show | 2024-03-20T00:49:25 | female | 0 | 0 |
321186 rows × 6 columns
df[df["user_id"] == 91875]
| user_id | event_type | event_time | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|
| 34703 | 91875 | onboarding_start | 2024-01-10T07:21:02 | male | exp_8 | test |
| 34705 | 91875 | profile_start | 2024-01-10T07:21:16 | male | exp_8 | test |
| 34708 | 91875 | email_submit | 2024-01-10T07:21:36 | male | exp_8 | test |
| 34717 | 91875 | paywall_show | 2024-01-10T07:28:28 | male | exp_8 | test |
Funnel¶
Вам дана выгрузка событий из web-воронки со следующими типами событий:
- onboarding_start - начало прохождения воронки
- profile_start - пользователь начал заполнение анкеты
- email_submit - пользователь ввел свой почтовый адрес
- paywall_show - показ пэйвола
- payment_done - пользователь совершил покупку
Посчитайте какая доля пользователей от изначального проходит до каждого из этапов воронки и какая доля теряется на каждом из этапов.
подготовка¶
funnel_order = [
'onboarding_start',
'profile_start',
'email_submit',
'paywall_show',
'payment_done'
]
def calculate_funnel_metrics(df, funnel_order, group_by_list):
# считаем пользователей
users_by_step = df.groupby(group_by_list)["user_id"].nunique().unstack(fill_value=0)
# Упорядочек
users_by_step = users_by_step[funnel_order]
# Конверсии
conv_from_start = users_by_step.div(users_by_step[funnel_order[0]], axis=0) * 100
# Потери
drop_offs = users_by_step.pct_change(axis=1).fillna(1) - 1
drop_offs = -drop_offs * 100
conv_from_start = conv_from_start.round(3)
drop_offs = drop_offs.round(3)
# Объединение
result = []
for funnel in users_by_step.index:
for stage in funnel_order:
result.append({
'group_by': funnel,
'event_type': stage,
'users': users_by_step.loc[funnel, stage],
'conversion_from_start': conv_from_start.loc[funnel, stage],
'drop_off_from_previous': drop_offs.loc[funnel, stage]
})
return pd.DataFrame(result)
по funnel_type¶
by_event_type = calculate_funnel_metrics(df, funnel_order, group_by_list = ["funnel_type", "event_type"])
by_event_type
| group_by | event_type | users | conversion_from_start | drop_off_from_previous | |
|---|---|---|---|---|---|
| 0 | female | onboarding_start | 54987 | 100.000 | -0.000 |
| 1 | female | profile_start | 46662 | 84.860 | 115.140 |
| 2 | female | email_submit | 41967 | 76.322 | 110.062 |
| 3 | female | paywall_show | 39555 | 71.935 | 105.747 |
| 4 | female | payment_done | 2887 | 5.250 | 192.701 |
| 5 | main | onboarding_start | 10027 | 100.000 | -0.000 |
| 6 | main | profile_start | 8326 | 83.036 | 116.964 |
| 7 | main | email_submit | 575 | 5.735 | 193.094 |
| 8 | main | paywall_show | 7897 | 78.757 | -1173.391 |
| 9 | main | payment_done | 683 | 6.812 | 191.351 |
| 10 | male | onboarding_start | 34986 | 100.000 | -0.000 |
| 11 | male | profile_start | 28063 | 80.212 | 119.788 |
| 12 | male | email_submit | 22685 | 64.840 | 119.164 |
| 13 | male | paywall_show | 20026 | 57.240 | 111.721 |
| 14 | male | payment_done | 1860 | 5.316 | 190.712 |
for event_type in by_event_type['group_by'].unique():
current_df = by_event_type[by_event_type['group_by'] == event_type]
data = dict(
number=current_df['users'].tolist(),
stage=current_df['event_type'].tolist(),
text=[f"{int(x)}%" for x in current_df['conversion_from_start']]
)
fig = px.funnel(data, x='number', y='stage', text='text')
fig.update_traces(textposition='inside')
fig.update_layout(title=f"Общая воронка регистрации для funnel_type = {event_type}",
width=800,
height=500)
fig.show()
Нет кучи событий по email_submit, считая что все этапы для воронки важны, можно предположить, что там данные пропали/нет информации
Около 10 процентов пользователей с таким встретились, поэтому можно их удалить ИЛИ оставить только тех, про кого есть информация, но тогда будет 100% проходимость
поэтому удаляем
df = df[df['funnel_type'] != 'main']
df
| user_id | event_type | event_time | funnel_type | experiment_name | experiment_group | |
|---|---|---|---|---|---|---|
| 193355 | 0 | onboarding_start | 2024-02-20T16:21:51 | male | 0 | 0 |
| 193356 | 0 | profile_start | 2024-02-20T16:22:05 | male | 0 | 0 |
| 193357 | 0 | email_submit | 2024-02-20T16:22:25 | male | 0 | 0 |
| 193375 | 0 | paywall_show | 2024-02-20T16:29:17 | male | 0 | 0 |
| 5818 | 2 | onboarding_start | 2024-01-02T16:12:34 | female | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 86357 | 99998 | paywall_show | 2024-01-23T07:49:26 | female | exp_8 | test |
| 300974 | 99999 | onboarding_start | 2024-03-20T00:41:12 | female | 0 | 0 |
| 300975 | 99999 | profile_start | 2024-03-20T00:41:30 | female | 0 | 0 |
| 300976 | 99999 | email_submit | 2024-03-20T00:41:55 | female | 0 | 0 |
| 300999 | 99999 | paywall_show | 2024-03-20T00:49:25 | female | 0 | 0 |
293678 rows × 6 columns
по experiment_name¶
by_experiment = calculate_funnel_metrics(df[df["experiment_name"] != 0], funnel_order, group_by_list = ["experiment_name", "experiment_group", "event_type"])
by_experiment[['experiment_name', 'group']] = by_experiment['group_by'].apply(pd.Series)
by_experiment
| group_by | event_type | users | conversion_from_start | drop_off_from_previous | experiment_name | group | |
|---|---|---|---|---|---|---|---|
| 0 | (exp_0, control) | onboarding_start | 1371 | 100.000 | -0.000 | exp_0 | control |
| 1 | (exp_0, control) | profile_start | 1146 | 83.589 | 116.411 | exp_0 | control |
| 2 | (exp_0, control) | email_submit | 994 | 72.502 | 113.264 | exp_0 | control |
| 3 | (exp_0, control) | paywall_show | 911 | 66.448 | 108.350 | exp_0 | control |
| 4 | (exp_0, control) | payment_done | 83 | 6.054 | 190.889 | exp_0 | control |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 75 | (exp_9, test) | onboarding_start | 1387 | 100.000 | -0.000 | exp_9 | test |
| 76 | (exp_9, test) | profile_start | 1218 | 87.815 | 112.185 | exp_9 | test |
| 77 | (exp_9, test) | email_submit | 1034 | 74.549 | 115.107 | exp_9 | test |
| 78 | (exp_9, test) | paywall_show | 956 | 68.926 | 107.544 | exp_9 | test |
| 79 | (exp_9, test) | payment_done | 74 | 5.335 | 192.259 | exp_9 | test |
80 rows × 7 columns
пс: на графики можно навести и там будут цифры
for exp in by_experiment['experiment_name'].unique():
current_df = by_experiment[by_experiment['experiment_name'] == exp]
fig = px.funnel(
current_df,
x='users',
y='event_type',
color='group', # control / test
text=current_df['conversion_from_start'].apply(lambda x: f"{x:.0f}%")
)
fig.update_traces(textposition='inside')
fig.update_layout(
title=f"Воронка регистрации для {exp}",
xaxis_title="Конверсия от старта (%)",
yaxis_title="Этап",
width=800,
height=500
)
fig.show()
по experiment_name и funnel_type¶
by_everything = calculate_funnel_metrics(df[df["experiment_name"] != 0], funnel_order, group_by_list = ["funnel_type", "experiment_name", "experiment_group", "event_type"])
by_everything[['funnel_type', 'experiment_name', 'group']] = by_everything['group_by'].apply(pd.Series)
by_everything
| group_by | event_type | users | conversion_from_start | drop_off_from_previous | funnel_type | experiment_name | group | |
|---|---|---|---|---|---|---|---|---|
| 0 | (female, exp_0, control) | onboarding_start | 832 | 100.000 | -0.000 | female | exp_0 | control |
| 1 | (female, exp_0, control) | profile_start | 710 | 85.337 | 114.663 | female | exp_0 | control |
| 2 | (female, exp_0, control) | email_submit | 635 | 76.322 | 110.563 | female | exp_0 | control |
| 3 | (female, exp_0, control) | paywall_show | 602 | 72.356 | 105.197 | female | exp_0 | control |
| 4 | (female, exp_0, control) | payment_done | 54 | 6.490 | 191.030 | female | exp_0 | control |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 155 | (male, exp_9, test) | onboarding_start | 546 | 100.000 | -0.000 | male | exp_9 | test |
| 156 | (male, exp_9, test) | profile_start | 461 | 84.432 | 115.568 | male | exp_9 | test |
| 157 | (male, exp_9, test) | email_submit | 361 | 66.117 | 121.692 | male | exp_9 | test |
| 158 | (male, exp_9, test) | paywall_show | 312 | 57.143 | 113.573 | male | exp_9 | test |
| 159 | (male, exp_9, test) | payment_done | 21 | 3.846 | 193.269 | male | exp_9 | test |
160 rows × 8 columns
for exp in by_everything['experiment_name'].unique():
df_exp = by_everything[by_everything['experiment_name'] == exp]
for funnel_type in df_exp['funnel_type'].unique():
df_sub = df_exp[df_exp['funnel_type'] == funnel_type]
fig = px.funnel(
df_sub,
x='users',
y='event_type',
color='group', # control / test
text=df_sub['conversion_from_start'].apply(lambda x: f"{x:.0f}%")
)
fig.update_traces(textposition='inside')
fig.update_layout(
title=f"Воронка для {exp}, funnel_type = {funnel_type}",
xaxis_title="Конверсия от старта (%)",
yaxis_title="Этап",
width=800,
height=500
)
fig.show()
сравниваем¶
df_cr = by_experiment[by_experiment['event_type'] == 'payment_done']
results_ztest = []
for exp in df_cr['experiment_name'].unique():
df_exp = df_cr[df_cr['experiment_name'] == exp]
if {'control', 'test'}.issubset(df_exp['group'].unique()):
control_row = df_exp[df_exp['group'] == 'control'].iloc[0]
test_row = df_exp[df_exp['group'] == 'test'].iloc[0]
control_n = int(control_row['users'])
test_n = int(test_row['users'])
control_conv = control_row['conversion_from_start']
test_conv = test_row['conversion_from_start']
uplift = (test_conv - control_conv) / control_conv * 100
# success (платежи)
count = [
control_conv,
test_conv
]
nobs = [test_n, control_n]
stat, p_value = proportions_ztest(count, nobs)
results_ztest.append({
'experiment_name': exp,
'control_conv': round(control_conv, 2),
'test_conv': round(test_conv, 2),
'uplift_percent': round(uplift, 2),
'p_value': round(p_value, 4)
})
results_ztest_df = pd.DataFrame(results_ztest).sort_values(by='uplift_percent', ascending=True).reset_index(drop=True)
results_ztest_df
| experiment_name | control_conv | test_conv | uplift_percent | p_value | |
|---|---|---|---|---|---|
| 0 | exp_8 | 6.13 | 5.15 | -15.95 | 0.5501 |
| 1 | exp_0 | 6.05 | 5.40 | -10.72 | 0.6169 |
| 2 | exp_1 | 6.99 | 6.84 | -2.25 | 0.9416 |
| 3 | exp_6 | 7.54 | 7.94 | 5.27 | 0.8284 |
| 4 | exp_5 | 7.52 | 7.98 | 6.10 | 0.7188 |
| 5 | exp_7 | 7.48 | 7.99 | 6.72 | 0.7620 |
| 6 | exp_9 | 4.71 | 5.34 | 13.37 | 0.7300 |
| 7 | exp_2 | 7.56 | 10.21 | 35.04 | 0.1905 |
df_cr = by_everything[by_everything['event_type'] == 'payment_done']
results_ztest = []
# Группируем по experiment_name и funnel_type
for (exp, funnel) in df_cr[['experiment_name', 'funnel_type']].drop_duplicates().itertuples(index=False):
df_exp = df_cr[(df_cr['experiment_name'] == exp) & (df_cr['funnel_type'] == funnel)]
if {'control', 'test'}.issubset(df_exp['group'].unique()):
control_row = df_exp[df_exp['group'] == 'control'].iloc[0]
test_row = df_exp[df_exp['group'] == 'test'].iloc[0]
control_n = int(control_row['users'])
test_n = int(test_row['users'])
control_conv = control_row['conversion_from_start']
test_conv = test_row['conversion_from_start']
count = [
control_conv,
test_conv
]
nobs = [test_n, control_n]
stat, pval = proportions_ztest(count, nobs)
uplift = ((test_conv - control_conv) / control_conv) * 100
results_ztest.append({
'experiment_name': exp,
'funnel_type': funnel,
'control_conv': round(control_conv, 2),
'test_conv': round(test_conv, 2),
'uplift_percent': round(uplift, 2),
'p_value': round(pval, 4)
})
results_ztest_df = pd.DataFrame(results_ztest).sort_values(by='p_value', ascending=True).reset_index(drop=True)
results_ztest_df
| experiment_name | funnel_type | control_conv | test_conv | uplift_percent | p_value | |
|---|---|---|---|---|---|---|
| 0 | exp_7 | male | 7.33 | 11.52 | 57.25 | 0.0061 |
| 1 | exp_6 | male | 6.95 | 9.60 | 38.13 | 0.1096 |
| 2 | exp_2 | female | 6.49 | 9.21 | 41.84 | 0.1702 |
| 3 | exp_8 | male | 6.51 | 4.49 | -30.96 | 0.1707 |
| 4 | exp_9 | female | 4.23 | 6.30 | 48.91 | 0.2061 |
| 5 | exp_2 | male | 9.70 | 12.01 | 23.83 | 0.2121 |
| 6 | exp_9 | male | 5.48 | 3.85 | -29.87 | 0.2268 |
| 7 | exp_7 | female | 7.57 | 6.00 | -20.66 | 0.3227 |
| 8 | exp_5 | female | 5.34 | 6.72 | 25.67 | 0.3341 |
| 9 | exp_0 | female | 6.49 | 5.32 | -18.00 | 0.4333 |
| 10 | exp_1 | male | 6.83 | 8.15 | 19.30 | 0.5253 |
| 11 | exp_5 | male | 11.98 | 10.38 | -13.37 | 0.5314 |
| 12 | exp_1 | female | 7.08 | 6.17 | -12.90 | 0.6507 |
| 13 | exp_6 | female | 7.85 | 7.12 | -9.33 | 0.7004 |
| 14 | exp_8 | female | 5.89 | 5.56 | -5.70 | 0.8599 |
| 15 | exp_0 | male | 5.38 | 5.54 | 3.05 | 0.9392 |
Выводы¶
exp_2 - самый переспективный overall: максимально низкое p-value среди всех, uplift_percent наивысши. Эффект размыт за счёт смешивания male/female в одну метрику, но и по отдельности они ведут себя хорошо, в обеих группах результат положительный (аплифт +23.83% и +41.84%)
exp_7 male - самый переспективный для раскатки, НО только на мужскую аудиторию приложения: высокий рост конверсии при высокой достоверности результата, Uplift: +57.25%, для женской аудитории спад -20.66%, но overall результат положительный 6.72%
exp_9 несмотря на то, что в exp_6 male лучше (а для женщин ниже), overall результат на две аудитории хуже, чем у exp_9 (тут наоборот у женщин лучше) учитвая то, что p-value exp_9 < p_value exp_6 и приложение расчитано больше на женскую аудиторию, то этот тест становится наилучшим выбором из двоих
